After running the demultiplexer_for_dada2 (http://github.com/ramongallego/demultiplexer_for_dada2), we have to denoise the whole dataset. We will do this by using 4 different processes:
Estimation of Tag-jumping or indices cross-talk . We run multiple samples on each MiSeq run. These are identified by two sets of molecular barcodes. There is the potential of some sequences to be assigned to the wrong sample, which is a bummer. To estimate how many reads did this, on each MiSeq run we added some samples whose composition is known and extremely unlikely to be present in the enviromental samples studied. AS a result of this Tag-jumping, some of the positive control sequences might show in the environmental samples and viceversa. In our case, these positive controls are made of either Kangaroo or Ostrich (and Alligator). The process consists on, for each run, to model the compositon observed on the positive controls and substract it from the environmental samples from that run. The output will be a dataset with the same number of samples as before, but with fewer reads of certain sequences (ASVs)
Discarding samples with extremely low number of reads. Sometimes the number of reads sequenced from a particular replicate are really low, and hence the relative proportions of ASVs would be skewed.
Full clearance from Positive control influence. THis process also takes advantage of the known composition of the positive controls. Each ASV found in the positive controls with a higher abundace in them than in the rest of the samples will be labelled as Positive and removed from the environmental dataset. The output will be a dataset with the same number of samples as before but with fewer ASVs.
Occupancy modelling . Is the presence of a ASV a reflection of a biological reality or likely a PCR artifact? This may seem trivial in extreme cases (an ASV that only appears in one PCR replicate in the whole dataset) but how to discriminate between PCR artifacts from rare but real organisms? We use Occupancy modelling to determine if the pattern of presence of a ASV in a dataset reflects that. The output of this procedure will be a datasetwith the same number of samples as before but with fewer ASVs.
Dissimilarity between PCR replicates. The workflow that leads to the sequencing of a particular sample is subject to many stochatic processes, and it is not unlikely that the composition retrieved is very different for the original community. A way to ensure that this difference is minimal is through the separate analysis of each PCR replicate. We used that approach and modeled the dissimilarity between each PCr replicate and the group centroid. This way of modeling the dissimilarity allows us to discard those PCR replicate that won’t fit the normal distribution of dissimilarities. The output of this procedure will be a dataset with the same number of Hashes as before but with fewer samples.
As with everything, we will start the process by loading the required packages and datasets.
We will load the ASV table and the metadata file. They are in the same folder so we use list.files to access them and a neat combination of bind.rows and map(read_csv)
all.asvs <- list.files (path ="..", pattern = "^ASV_table_r", recursive = T, full.names = T)
all.metadata <- list.files (path ="..", pattern = "^metadata_r", recursive = T, full.names = T)
all.hashes <- list.files (path ="..", pattern = "^Hash_key_r", recursive = T, full.names = T,ignore.case = T)
ASV.table <- bind_rows(map(all.asvs, read_csv), .id = "Miseq_run")
Parsed with column specification:
cols(
sample = [31mcol_character()[39m,
Hash = [31mcol_character()[39m,
nReads = [32mcol_double()[39m
)
Parsed with column specification:
cols(
sample = [31mcol_character()[39m,
Hash = [31mcol_character()[39m,
nReads = [32mcol_double()[39m
)
Parsed with column specification:
cols(
sample = [31mcol_character()[39m,
Hash = [31mcol_character()[39m,
nReads = [32mcol_double()[39m
)
Parsed with column specification:
cols(
sample = [31mcol_character()[39m,
Hash = [31mcol_character()[39m,
nReads = [32mcol_double()[39m
)
Parsed with column specification:
cols(
sample = [31mcol_character()[39m,
Hash = [31mcol_character()[39m,
nReads = [32mcol_double()[39m
)
metadata <- bind_rows(map(all.metadata, function(x) {
read_csv(x) %>%
dplyr::select("sample_id", "pri_index_name", "Tag"= "sec_index_name")
}),.id = "Miseq_run")
Parsed with column specification:
cols(
.default = col_character(),
distance = [32mcol_double()[39m,
insert_size = [32mcol_double()[39m,
sec_index_start = [32mcol_double()[39m
)
See spec(...) for full column specifications.
Parsed with column specification:
cols(
.default = col_character(),
distance = [32mcol_double()[39m,
well_location = [33mcol_logical()[39m,
insert_size = [32mcol_double()[39m,
sec_index_start = [32mcol_double()[39m
)
See spec(...) for full column specifications.
Parsed with column specification:
cols(
.default = col_character(),
distance = [32mcol_double()[39m,
well_location = [33mcol_logical()[39m,
insert_size = [32mcol_double()[39m,
sec_index_start = [32mcol_double()[39m
)
See spec(...) for full column specifications.
Parsed with column specification:
cols(
.default = col_character(),
distance = [32mcol_double()[39m,
well_location = [33mcol_logical()[39m,
insert_size = [32mcol_double()[39m,
sec_index_start = [32mcol_double()[39m
)
See spec(...) for full column specifications.
Parsed with column specification:
cols(
.default = col_character(),
distance = [32mcol_double()[39m,
well_location = [33mcol_logical()[39m,
insert_size = [32mcol_double()[39m,
sec_index_start = [32mcol_double()[39m
)
See spec(...) for full column specifications.
Hash.key <- bind_rows(map(all.hashes, read_csv))
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
Sequence = [31mcol_character()[39m
)
Hash.key %>%
distinct(Hash, .keep_all = T) -> Hash.key
Emily already cleaneup up the data. A few things we check for: That no sample appears twice in the metadata. That the metadata uses Tag_01 instead of Tag_1 (so it can be sorted alphabetically). That the structure Site_YYYYMM[A-C].[1-3] is the same across the dataset.
# Check that no sample appears more than once in the metadata
metadata %>%
group_by(sample_id) %>%
summarise(tot = n()) %>%
arrange(desc(tot)) # Samples only appear once
# We should change Tag_1 for Tag_01
metadata %>%
mutate(Tag = case_when(str_detect(Tag, "\\_[0-9]{1}$") ~ str_replace(Tag, "Tag_", "Tag_0"),
TRUE ~ Tag )) -> metadata
metadata %>% mutate(x= str_count(sample_id, pattern = "_")) %>% arrange(desc(x)) # Max number of underscores is 5
metadata %>% mutate(x= str_count(sample_id, pattern = "_")) %>% arrange((x)) # Min number of underscores is also 5
ASV.table %>% distinct(sample) %>% mutate(x= str_count(sample, pattern = "_")) %>% arrange(desc(x)) # Also in the ASV table
ASV.table %>% distinct(sample) %>% mutate(x= str_count(sample, pattern = "_")) %>% arrange((x)) # Also in the ASV table
NA
Now let’s check the positive controls: are all for each run kangaroos or also ostriches
metadata %>%
filter(str_detect(sample_id, "NA"))
NA
Looking at the metadata - it used Ostriches in Round 1 and Kangaroo in rounds 2-5
The output of this process are a clean ASV table and a clean metadata file.
Before we modify our datasets on any way, we can calculate how many sequences that were only supposed to be in the positives control appeared in the environmental samples, and how many did the opposite. First we divide the dataset into positive control and environmental samples. Also create an ordered list of the Hashes present in the positive controls, for ease of plotting
ASV.table %>% mutate(source = case_when(str_detect(sample, "NA_") ~ "Positives",
TRUE ~ "Samples")) -> ASV.table
ASV.table %>%
filter (source == "Positives") %>%
group_by(Hash) %>%
summarise(tot = sum(nReads)) %>%
arrange(desc(tot)) %>%
pull(Hash) -> list.of.hashes.in.positive
Now let’s create a jumping vector. What proportion of the reads found in the positives control came from elsewhere?, and what proportion of the reads in the samples came from the positives control?
To streamline the process and make it easier to execute it similarly but independently on each Miseq run, we nest the dataset by run. So Step1 is create a nested table so we can run this analysis on each run independently.
ASV.table %>%
group_by(Miseq_run, source) %>%
nest() %>%
spread(source, data) -> ASV.nested
That wasn’t too complicated. Let’s start a summary function that keeps track of our cleaning process
We create a vector of the composition of each positive control and substract it from the environmental samples from their runs
ASV.nested %>%
mutate (contam.tibble = map(Positives,
function(.x){
.x %>% # Take the tibble from each run
group_by(sample) %>%
mutate (TotalReadsperSample = sum(nReads)) %>% # How many reads per sample
mutate (proportion = nReads/TotalReadsperSample) %>% # What proportion does each Hash represent
group_by (Hash) %>%
summarise (vector_contamination = max (proportion)) # What is the maximum proportion that Hash ever gets
}) ) -> ASV.nested
ASV.nested %>%
unnest(contam.tibble) %>%
group_by(Miseq_run) %>%
summarise(nhash = n_distinct(Hash))# Check how it looks like
NA
NA
NA
The idea behind this procedure is that we know, for each run, how many reads from each Hash appeared in teh positive controls. These come from 2 processes: sequences we know should appear in the positive controls, and sequences that have jumped from the environment to the positive controls. With this procedure, we substract from every environmental sample the proportion of reads that jumped from elsewhere.
ASV.nested %>%
mutate(cleaned.tibble = map2(Samples, contam.tibble, function(.x,.y){ # We need two columns: the sample data and the contamination tibble
.x %>% # Get the samples tibble
group_by (sample) %>%
mutate (TotalReadsperSample = sum (nReads)) %>% # Calculate the number of reads per sample
left_join(.y, by = "Hash") %>% # Add, for each Hash shared with the positives, the value of the contamination vector
#
mutate (Updated_nReads = case_when (!is.na(vector_contamination) ~ (as.numeric(nReads) - (ceiling(vector_contamination*TotalReadsperSample))),
TRUE ~ as.numeric(nReads))) %>% # If the hash is present in the positives, substract the value of the vector times the total sample depth, otherwise, leave it as it is
filter (Updated_nReads > 0) %>% # Kepp only values > 0
ungroup() %>%
dplyr::select (sample, Hash, nReads = Updated_nReads) # Leave no trace
})) -> ASV.nested
ASV.nested %>%
unnest(cleaned.tibble) #Check how they look
NA
NA
Add this step to the summary table we were creating
ASV.nested %>%
transmute(Miseq_run, Summary.1 = map(cleaned.tibble, ~ how.many(ASVtable = .,round = "1.Jump"))) %>%
left_join(ASV.summary) %>%
mutate(Summary = map2(Summary, Summary.1, bind_rows)) %>%
dplyr::select(-Summary.1) -> ASV.summary
Joining, by = "Miseq_run"
# Check how does it look
ASV.summary %>%
unnest(Summary)
NA
We will fit the number of reads assigned to each sample to a normal distribution and discard those samples with a probability of 95% of not fitting in that distribution. The output would be a dataset with less samples and potentially less number of unique Hashes.
ASV.nested %>%
unnest(cleaned.tibble) %>%
group_by(sample) %>%
summarise(tot = sum(nReads)) -> all.reps
# Visualize
all.reps %>%
pull(tot) -> reads.per.sample
names(reads.per.sample) <- all.reps %>% pull(sample)
normparams.reads <- MASS::fitdistr(reads.per.sample, "normal")$estimate # Fit the number of reads per sample to a normal distribution
all.reps %>%
mutate(prob = pnorm(tot, normparams.reads[1], normparams.reads[2])) -> all.reps # Probability of each depth to come from teh same distribution
outliers <-
all.reps %>%
filter(prob < 0.05 & tot < normparams.reads[1]) # Which sample are below 0.1
outliers %>%
arrange(tot) # Some have as few as 4k
outliers %>%
arrange(desc(tot)) # Max number of reads of a sample removed is 20k
ASV.nested %>%
mutate(Step.1.low.reads = map (cleaned.tibble, ~ filter(.,!sample %in% outliers$sample) %>% ungroup)) -> ASV.nested # Remove outliers
ASV.nested %>%
transmute(Miseq_run, Summary.1 = map(Step.1.low.reads, ~ how.many(ASVtable = .,round = "2.Low.nReads"))) %>%
left_join(ASV.summary) %>%
mutate(Summary = map2(Summary, Summary.1, bind_rows)) %>%
dplyr::select(-Summary.1) -> ASV.summary
Joining, by = "Miseq_run"
Removing the Hashes that belong to the positive controls. First, for each Hash that appeared in the positive controls, determine whether a sequence is a true positive or a true environment. For each Hash, we will calculate, maximum, mean and total number of reads in both positive and samples, and then we will use the following decission tree:
If all three statistics are higher in one of the groups, we will label it either of Environmental or Positive control influence.
If there are conflicting results, we will use the Hashes. to see if they belong to either the maximum abundance of a Hash is in a positive, then it is a positive, otherwise is a real sequence from the environment.
Now, for each Hash in each set of positives controls, calculate the proportion of reads that were missasigned - they appeared somewhere they were not expected. We will divide that process in two: first . A second step would be to create a column named proportion switched, which states the proportion of reads from one Hash that jumped from the environment to a positive control or viceversa. The idea is that any presence below a threshold can be arguably belong to tag jumping.
ASV.table %>% # Taking the whole dataset
filter (Hash %in% list.of.hashes.in.positive) %>% # keep only those hashes that have shown at least once in a positive control
group_by(sample) %>%
mutate(tot.reads = sum(nReads)) %>%
group_by(Hash,sample) %>%
mutate(prop = nReads/tot.reads) %>%
group_by(Hash, source) %>%
summarise (max. = max(prop),
mean. = mean(prop),
tot. = sum(nReads)) %>%
gather(contains("."), value = "number", key = "Stat") %>%
spread(key = "source", value = "number", fill = 0) %>%
group_by(Hash, Stat) %>%
mutate(origin = case_when(Positives > Samples ~ "Positive.control",
TRUE ~ "Environment")) %>%
group_by (Hash) %>%
mutate(tot = n_distinct(origin)) -> Hash.fate.step2
# Do all three measurements agree
Hash.fate.step2 %>%
filter(tot != 1 )
# No - remove only those for which the three measurements agree
Hash.fate.step2 %>%
filter(tot == 1) %>%
group_by(Hash) %>%
summarise(origin = unique(origin)) %>%
filter(origin == "Positive.control") -> Hashes.to.remove.step2
ASV.table %>%
group_by(source, Hash) %>%
summarise(ocurrences =n()) %>%
spread(key = source, value = ocurrences, fill = 0) %>%
#left_join(Hashes.to.remove.step2) %>%
#mutate(origin = case_when(is.na(origin) ~ "Kept",
# TRUE ~ "Discarded")) %>%
mutate(second.origin = case_when(Positives >= Samples ~ "Discarded",
TRUE ~ "Kept")) %>%
filter(second.origin == "Discarded") %>%
full_join(Hashes.to.remove.step2) -> Hashes.to.remove.step2
Joining, by = "Hash"
IN order to train DADA2 to better distinguish when positive control sequences have arrived in the environment, we will keep the sequences in a csv file
rr
Hashes.to.remove.step2 %>% left_join(Hash.key) %>% select(Hash, Sequence) %>% write_csv(.to.remove.csv)
What is the probabilty of a true positive presence of a Hash in a Miseq Run. We will use eDNA occupancy modeling to asses whether a hash is a rare variant that spilled out or a true presence.
The process requires to load extra packages, create some model file, and group the hashes by Run, and biological replicate, summarising the data in a presence absence format.
The occupancy model itself was performed in the Rmarkdown file Rjags.tunning.Rmd, so here we will upload the csv file that contains all probability of occurences of all hashes per site. Each Hash-Site combination produces a matrix of presence abascences that feeds the model - for some cases it is a 30x3 matrix, for others it is a 39x3. We summarised the number of occurences in each case and run models for each unique case (to save computing time). Each unique model was run 10 times to filter out cases in which the model converge into a local maxima.
So we will import the object Occ.fate.csv and reduce the dataset to those Hashes with an occ > 0.8
occ.results <- read_csv("Occ.fate.csv")
Parsed with column specification:
cols(
Hash = [31mcol_character()[39m,
model = [32mcol_double()[39m
)
occ.results %>%
ggplot(aes(x = model)) +
geom_histogram()
occ.results %>%
left_join(ASV.nested %>%
unnest(Step2.tibble) %>%
group_by(Hash) %>%
summarise (tot = sum(nReads))) %>%
ggplot(aes(x = cut_interval(model, n = 20))) +
geom_col(aes(y = tot), fill = "red")
Joining, by = "Hash"
So we will throw away most of the Hashes, but will keep most of the reads - we are getting into something here
occ.results %>%
filter(model > 0.8) %>%
pull (Hash) -> to.keep
ASV.nested %>%
mutate(Step3.tibble = map (Step2.tibble, ~ filter(.,Hash %in% to.keep))) -> ASV.nested
ASV.nested %>%
transmute(Miseq_run, Summary.1 = map(Step3.tibble, ~ how.many(ASVtable = .,round = "4.Occupancy"))) %>%
left_join(ASV.summary) %>%
mutate(Summary = map2(Summary, Summary.1, bind_rows)) %>%
dplyr::select(-Summary.1) -> ASV.summary
Joining, by = "Miseq_run"
So, a second way of cleaning the dataset is to remove samples for which the dissimilarity between PCR replicates exceeds the normal distribution of dissimilarities. Sometimes the preparation of a PCR replicate goes wrong for a number of reasons - that leads to a particular PCR replicate to be substantially different to the other 2. In that case, we will remove the PCR replicate that has higher dissimilarity with the other two.
The process starts by adding the biological information to the ASV table, then diving the dataset by their biological replicate. This will also remove any sample that is not included in the metadata, eg coming from a different project.
ASV.nested %>%
unnest(Step3.tibble) %>%
separate(sample, into = c(1:4,"rep", "run"), sep = "_", remove = F) %>%
unite(`1`,`2`,`3`,`4`, run, col = "original_sample") -> cleaned.tibble
rr # do all samples have a name cleaned.tibble %>% filter (sample == \) # YES
cleaned.tibble %>% filter(original_sample == \) # YES
cleaned.tibble %>% filter(is.na(Hash)) # YES
cleaned.tibble %>% summarise(n_distinct(sample), # 325 n_distinct(Hash)) # 3166
cleaned.tibble %>% group_by(original_sample) %>% summarise(nrep = n_distinct(sample)) %>% #filter (nrep == 2) # 13 filter (nrep == 1) # 4
Ok, so there are 4 for which we only have 1 PCR replicate and 13 samples for which we only have 2 PCR replicates1. We will get rid of those with only 1, as we can’t estimate the PCR bias there.
discard.1 <- cleaned.tibble %>%
group_by(original_sample) %>%
mutate(nrep = n_distinct(sample)) %>%
#filter (nrep == 2) # 25
filter (nrep == 1) %>%
distinct(sample) %>% pull(sample)
cleaned.tibble %>%
filter(!sample %in% discard.1) -> cleaned.tibble
Anyway, let’s have a visual representation of the dissimilarities between PCR replicates, biological replicates and everything else.
cleaned.tibble %>%
group_by (sample) %>%
mutate (Tot = sum(nReads),
Row.sums = nReads / Tot) %>%
group_by (Hash) %>%
mutate (Colmax = max (Row.sums),
Normalized.reads = Row.sums / Colmax) -> cleaned.tibble
tibble_to_matrix <- function (tb) {
tb %>%
group_by(sample, Hash) %>%
summarise(nReads = sum(Normalized.reads)) %>%
spread ( key = "Hash", value = "nReads", fill = 0) -> matrix_1
samples <- pull (matrix_1, sample)
matrix_1 %>%
ungroup() %>%
dplyr::select ( - sample) -> matrix_1
data.matrix(matrix_1) -> matrix_1
dimnames(matrix_1)[[1]] <- samples
vegdist(matrix_1) -> matrix_1
}
tibble_to_matrix (cleaned.tibble) -> all.distances.full
#names(all.distances.full)
summary(is.na(names(all.distances.full)))
Mode FALSE
logical 320
Let’s make the pairwaise distances a long table
as_tibble(as.matrix(all.distances.full) , rownames = "Sample1") %>%
gather(-Sample1, key = "Sample2", value = "value") -> all.distances.melted
summary(is.na(all.distances.melted$value))
Mode FALSE
logical 102400
# Now, create a three variables for all distances, they could be PCR replicates, BIOL replicates, or from the same site
all.distances.melted %>%
separate (Sample1, into = c("Site1", "A", "B", "Month1", "rep", "round"), sep = "_", remove = FALSE) %>%
unite (Site1,Month1, col = "Day.site1", remove = FALSE) %>%
unite (Site1, Month1, A, B, round, col = "Bottle1", remove = F) %>%
separate (Sample2, into = c("Site2", "A2", "B2", "Month2", "rep2", "round2"), sep = "_", remove = FALSE) %>%
unite (Site2, Month2, col = "Day.site2", remove = FALSE) %>%
unite (Site2, Month2, A2, B2, round2, col = "Bottle2", remove = F) %>%
mutate ( Distance.type = case_when( Bottle1 == Bottle2 ~ "PCR.replicates",
Day.site1 == Day.site2 ~ "Biol.replicates",
Site1 == Site2 ~ "Same Site",
TRUE ~ "Different Site"
)) %>%
dplyr::select(Sample1 ,Sample2 , value , Distance.type) %>%
filter (Sample1 != Sample2) -> all.distances.to.plot
# Checking all went well
sapply(all.distances.to.plot, function(x) summary(is.na(x)))
Sample1 Sample2 value Distance.type
Mode "logical" "logical" "logical" "logical"
FALSE "102080" "102080" "102080" "102080"
all.distances.to.plot$Distance.type <- all.distances.to.plot$Distance.type %>% fct_relevel( "PCR.replicates", "Biol.replicates", "Same Site")
ggplot (all.distances.to.plot , aes (fill = Distance.type, x = value)) +
geom_histogram (position = "dodge", stat = 'density', alpha = 0.9) +
# facet_wrap( ~ Distance.type) +
labs (x = "Pairwise dissimilarity", y = "density" ,
Distance.type = "Distance")
Ignoring unknown parameters: binwidth, bins, pad
NA
# Instead of chosing based on the pw distances, we can do a similar thing using the distance to centroid
# Find out which samples have only two pcr replicates
cleaned.tibble %>% dplyr::select(-Miseq_run) %>% group_by(original_sample) %>% nest() -> nested.cleaning
nested.cleaning %>%
mutate(matrix = map(data, tibble_to_matrix)) -> nested.cleaning
nested.cleaning %>% mutate(ncomparisons = map(matrix, length)) -> nested.cleaning
dist_to_centroid <- function (x,y) {
biol <- rep(y, length(x))
if (length(biol) == 1) {
output = rep(x[1]/2,2)
names(output) <- attr(x, "Labels")
}else{
dispersion <- betadisper(x, group = biol)
output = dispersion$distances
}
output
}
nested.cleaning <- nested.cleaning %>% mutate (distances = map2(matrix, original_sample, dist_to_centroid))
unlist (nested.cleaning$distances) -> all_distances
#normparams <- fitdistr(all_pairwise_distances, "normal")$estimate
normparams <- MASS::fitdistr(all_distances, "normal")$estimate
# probs <- pnorm(all_pairwise_distances, normparams[1], normparams[2])
probs <- pnorm(all_distances, normparams[1], normparams[2])
outliers <- which(probs>0.95)
discard <-names (all_distances[outliers])
all_distances
CI_Ac_6_Jul_A_r1 CI_Ac_6_Jul_B_r1 CI_Ac_6_Jul_C_r1 PG_Al_10_Jul_A_r1 PG_Al_10_Jul_B_r1 PG_Al_10_Jul_C_r1 PG_Ac_10_Jul_A_r1 PG_Ac_10_Jul_B_r1 PG_Ac_10_Jul_C_r1 CI_Eg_0_Jul_A_r1
0.24200187 0.34174402 0.32306106 0.34569852 0.39375650 0.21568054 0.28502422 0.39321312 0.30761436 0.22918036
CI_Eg_0_Jul_B_r1 CI_Eg_0_Jul_C_r1 PG_Al_6_Jul_A_r1 PG_Al_6_Jul_B_r1 PG_Al_6_Jul_C_r1 PG_Eg_0_Jul_A_r1 PG_Eg_0_Jul_B_r1 PG_Eg_0_Jul_C_r1 PG_Al_1_Jul_A_r1 PG_Al_1_Jul_B_r1
0.25999527 0.31929866 0.33851352 0.30221775 0.26451030 0.35466640 0.39655652 0.29528919 0.31310653 0.32123476
PG_Al_1_Jul_C_r1 PG_Ac_15_Jul_A_r1 PG_Ac_15_Jul_B_r1 PG_Ac_15_Jul_C_r1 PG_Ac_6_Jul_A_r1 PG_Ac_6_Jul_B_r1 PG_Ac_6_Jul_C_r1 PG_Ba_50_Jul_A_r1 PG_Ba_50_Jul_B_r1 PG_Ba_50_Jul_C_r1
0.25710159 0.31107969 0.14741295 0.35248469 0.19075655 0.18703610 0.24260378 0.32953600 0.47296674 0.35552258
PG_Al_3_Jul_A_r1 PG_Al_3_Jul_B_r1 PG_Al_3_Jul_C_r1 PG_Al_15_Jul_A_r1 PG_Al_15_Jul_B_r1 PG_Al_15_Jul_C_r1 PG_Ac_3_Jul_A_r1 PG_Ac_3_Jul_B_r1 PG_Ac_3_Jul_C_r1 PG_Ac_1_Jul_A_r1
0.43226523 0.57354538 0.47514697 0.33374010 0.26613421 0.32387645 0.55634241 0.46461681 0.49958454 0.19241471
PG_Ac_1_Jul_B_r1 PG_Ac_1_Jul_C_r1 CI_Ba_50_Jul_A_r1 CI_Ba_50_Jul_B_r1 CI_Ba_50_Jul_C_r1 CI_Al_6_Jul_A_r1 CI_Al_6_Jul_B_r1 CI_Al_6_Jul_C_r1 CI_Al_3_Jul_A_r1 CI_Al_3_Jul_B_r1
0.19449552 0.24643142 0.31675244 0.22469647 0.23088452 0.26109615 0.26858849 0.24895741 0.11841821 0.24911663
CI_Al_3_Jul_C_r1 CI_Al_1_Jul_A_r1 CI_Al_1_Jul_B_r1 CI_Al_1_Jul_C_r1 CI_Ac_10_Jul_A_r1 CI_Ac_10_Jul_B_r1 CI_Ac_10_Jul_C_r1 CI_Al_10_Jul_A_r1 CI_Al_10_Jul_B_r1 CI_Al_10_Jul_C_r1
0.34861559 0.24110032 0.27353656 0.24121474 0.25870697 0.35651093 0.35253060 0.25418451 0.29351107 0.31393867
CI_Ac_15_Jul_A_r1 CI_Ac_15_Jul_B_r1 CI_Ac_15_Jul_C_r1 CI_Ac_3_Jul_A_r1 CI_Ac_3_Jul_B_r1 CI_Ac_3_Jul_C_r1 CI_Ac_1_Jul_A_r1 CI_Ac_1_Jul_B_r1 CI_Ac_1_Jul_C_r1 NR_Eg_0_Jul_A_r2
0.30993978 0.34355805 0.30402237 0.32284570 0.30838100 0.31709095 0.35613976 0.33536263 0.36095813 0.29007245
NR_Eg_0_Jul_B_r2 NR_Eg_0_Jul_C_r2 NR_Eg_0_Aug_A_r2 NR_Eg_0_Aug_B_r2 NR_Eg_0_Aug_C_r2 CI_Ba_50_May_A_r2 CI_Ba_50_May_B_r2 CI_Ba_50_May_C_r2 PG_Ba_50_Jul_A_r2 PG_Ba_50_Jul_B_r2
0.22035554 0.42125162 0.53303093 0.34041943 0.29168929 0.33546256 0.34932460 0.33702868 0.30398593 0.44900583
PG_Ba_50_Jul_C_r2 NR_Ba_50_Jul_A_r2 NR_Ba_50_Jul_B_r2 NR_Ba_50_Jul_C_r2 SK_Ba_50_Jul_A_r2 SK_Ba_50_Jul_B_r2 SK_Ba_50_Jul_C_r2 CI_Al_15_Jul_A_r2 CI_Al_15_Jul_B_r2 CI_Al_15_Jul_C_r2
0.35210709 0.31295169 0.31807135 0.35306339 0.18418890 0.16238477 0.27181311 0.25776458 0.32829312 0.23044470
PG_Eg_0_May_A_r2 PG_Eg_0_May_B_r2 PG_Eg_0_May_C_r2 NR_Ba_50_May_A_r2 NR_Ba_50_May_B_r2 NR_Ba_50_May_C_r2 NR_Eg_0_May_A_r2 NR_Eg_0_May_B_r2 NR_Eg_0_May_C_r2 NR_Ba_50_Aug_A_r2
0.22697359 0.27968790 0.34636294 0.26005736 0.30216057 0.25317823 0.32864704 0.31403972 0.35679845 0.33774239
NR_Ba_50_Aug_B_r2 NR_Ba_50_Aug_C_r2 PG_Ba_50_May_A_r2 PG_Ba_50_May_B_r2 PG_Ba_50_May_C_r2 CI_Eg_0_Aug_A_r2 CI_Eg_0_Aug_B_r2 CI_Eg_0_Aug_C_r2 CI_Eg_0_May_A_r2 CI_Eg_0_May_B_r2
0.34260413 0.31942453 0.26507578 0.21317699 0.25019818 0.26537513 0.34614518 0.36713967 0.21900841 0.28320224
CI_Eg_0_May_C_r2 CI_Eg_0_Jul_A_r2 CI_Eg_0_Jul_B_r2 CI_Eg_0_Jul_C_r2 PG_Eg_0_Aug_A_r2 PG_Eg_0_Aug_B_r2 PG_Eg_0_Aug_C_r2 SK_Eg_0_Jul_A_r2 SK_Eg_0_Jul_B_r2 SK_Eg_0_Jul_C_r2
0.27460340 0.23672489 0.22090377 0.28397489 0.32394486 0.28501121 0.29141161 0.28678923 0.26117762 0.28734971
CI_Ba_50_Aug_A_r2 CI_Ba_50_Aug_B_r2 CI_Ba_50_Aug_C_r2 SK_Ba_50_May_A_r2 SK_Ba_50_May_B_r2 SK_Ba_50_May_C_r2 SK_Eg_0_May_A_r2 SK_Eg_0_May_B_r2 SK_Eg_0_May_C_r2 PG_Ba_50_Aug_A_r2
0.29876689 0.33841614 0.33589507 0.27985906 0.33486241 0.32205972 0.36629655 0.40524124 0.36280857 0.27767793
PG_Ba_50_Aug_B_r2 PG_Ba_50_Aug_C_r2 SK_Eg_0_Aug_A_r2 SK_Eg_0_Aug_B_r2 SK_Eg_0_Aug_C_r2 SK_Ba_50_Aug_A_r2 SK_Ba_50_Aug_B_r2 SK_Ba_50_Aug_C_r2 PG_Al_10_Aug_A_r3 PG_Al_10_Aug_B_r3
0.19959278 0.34629456 0.09868401 0.30230470 0.34412819 0.22875852 0.35223723 0.34129988 0.45420446 0.46989538
PG_Al_10_Aug_C_r3 PG_Al_1_Aug_A_r3 PG_Al_1_Aug_B_r3 PG_Al_1_Aug_C_r3 PG_Al_6_Aug_A_r3 PG_Al_6_Aug_B_r3 PG_Al_6_Aug_C_r3 PG_Al_15_Aug_A_r3 PG_Al_15_Aug_B_r3 PG_Al_15_Aug_C_r3
0.50160669 0.38051839 0.44913027 0.53447100 0.59542974 0.52314628 0.52313007 0.49347501 0.29470489 0.40220504
PG_Al_3_Aug_A_r3 PG_Al_3_Aug_B_r3 PG_Al_3_Aug_C_r3 CI_Al_15_Aug_A_r3 CI_Al_15_Aug_B_r3 CI_Al_15_Aug_C_r3 CI_Al_3_May_A_r3 CI_Al_3_May_B_r3 CI_Al_3_May_C_r3 CI_Al_1_May_A_r3
0.39343493 0.37371046 0.53464146 0.35819558 0.70786448 0.37774145 0.41749960 0.46179586 0.47310930 0.38503137
CI_Al_1_May_B_r3 CI_Al_1_May_C_r3 CI_Al_6_Aug_A_r3 CI_Al_6_Aug_B_r3 CI_Al_6_Aug_C_r3 CI_Al_10_May_A_r3 CI_Al_10_May_B_r3 CI_Al_10_May_C_r3 PG_Al_1_May_A_r3 PG_Al_1_May_B_r3
0.30890161 0.34060883 0.36106073 0.60419559 0.36449527 0.33875876 0.38102592 0.39789148 0.36481755 0.41894267
PG_Al_1_May_C_r3 CI_Al_6_May_A_r3 CI_Al_6_May_B_r3 CI_Al_6_May_C_r3 CI_Al_1_Aug_A_r3 CI_Al_1_Aug_B_r3 CI_Al_1_Aug_C_r3 CI_Al_10_Aug_A_r3 CI_Al_10_Aug_B_r3 CI_Al_10_Aug_C_r3
0.40066632 0.23444420 0.14529500 0.20385169 0.30268467 0.37456491 0.35522730 0.34475791 0.67690891 0.34261752
PG_Al_3_May_A_r3 PG_Al_3_May_B_r3 PG_Al_3_May_C_r3 PG_Al_15_May_A_r3 PG_Al_15_May_B_r3 PG_Al_15_May_C_r3 PG_Al_6_May_A_r3 PG_Al_6_May_B_r3 PG_Al_6_May_C_r3 CI_Al_3_Aug_A_r3
0.43592808 0.44160259 0.43533662 0.31669588 0.33368310 0.44518536 0.35996621 0.33542982 0.34252419 0.29998097
CI_Al_3_Aug_B_r3 CI_Al_3_Aug_C_r3 PG_Al_10_May_A_r3 PG_Al_10_May_B_r3 PG_Al_10_May_C_r3 CI_Al_15_May_A_r3 CI_Al_15_May_B_r3 CI_Al_15_May_C_r3 SK_Al_3_Aug_A_r4 SK_Al_3_Aug_B_r4
0.49969776 0.21262439 0.35959016 0.35008060 0.25706306 0.53227846 0.45169104 0.52530436 0.25851686 0.25851686
SK_Al_6_Aug_A_r4 SK_Al_6_Aug_C_r4 SK_Al_10_Aug_A_r4 SK_Al_10_Aug_B_r4 SK_Al_10_Aug_C_r4 SK_Al_6_Jul_A_r4 SK_Al_6_Jul_C_r4 SK_Al_15_Aug_A_r4 SK_Al_15_Aug_B_r4 NR_Ac_15_Aug_A_r4
0.18860066 0.18860066 0.56642189 0.41764497 0.48352706 0.14838293 0.14838293 0.41406761 0.41406761 0.25183525
NR_Ac_15_Aug_C_r4 SK_Al_15_Jul_A_r4 SK_Al_15_Jul_C_r4 NR_Ac_3_Aug_A_r4 NR_Ac_3_Aug_B_r4 NR_Ac_3_Aug_C_r4 NR_Ac_10_Aug_A_r4 NR_Ac_10_Aug_B_r4 NR_Ac_10_Aug_C_r4 SK_Al_10_Jul_A_r4
0.25183525 0.11404550 0.11404550 0.44317436 0.43689397 0.65579365 0.48534467 0.66464032 0.48533689 0.25988738
SK_Al_10_Jul_C_r4 NR_Ac_1_Aug_B_r4 NR_Ac_1_Aug_C_r4 NR_Ac_6_Jul_A_r4 NR_Ac_6_Jul_B_r4 NR_Ac_6_Jul_C_r4 NR_Ac_1_Jul_A_r4 NR_Ac_1_Jul_B_r4 NR_Ac_10_Jul_B_r4 NR_Ac_10_Jul_C_r4
0.25988738 0.37937801 0.37937801 0.46065944 0.54650343 0.50450129 0.31422890 0.31422890 0.33781188 0.33781188
NR_Ac_3_Jul_A_r4 NR_Ac_3_Jul_B_r4 NR_Ac_3_Jul_C_r4 SK_Al_1_Aug_A_r5 SK_Al_1_Aug_B_r5 SK_Al_1_Aug_C_r5 SK_Al_1_Jul_A_r5 SK_Al_1_Jul_B_r5 SK_Al_1_Jul_C_r5 SK_Al_3_Aug_B_r5
0.38081279 0.36151068 0.39520191 0.39544437 0.37272544 0.44879033 0.37597606 0.36446334 0.28335298 0.49325825
SK_Al_3_Aug_C_r5 SK_Al_3_Jul_A_r5 SK_Al_3_Jul_B_r5 SK_Al_3_Jul_C_r5 SK_Al_6_Aug_A_r5 SK_Al_6_Aug_B_r5 SK_Al_6_Aug_C_r5 SK_Al_6_Jul_A_r5 SK_Al_6_Jul_B_r5 SK_Al_6_Jul_C_r5
0.49325825 0.27161327 0.34808484 0.30041275 0.35028155 0.42081913 0.30624596 0.37380795 0.30137539 0.39044536
SK_Al_10_Aug_A_r5 SK_Al_10_Aug_B_r5 SK_Al_10_Aug_C_r5 SK_Al_10_Jul_A_r5 SK_Al_10_Jul_B_r5 SK_Al_10_Jul_C_r5 SK_Al_15_Aug_B_r5 SK_Al_15_Aug_C_r5 SK_Al_15_Jul_A_r5 SK_Al_15_Jul_B_r5
0.42392161 0.48554773 0.41325803 0.35084553 0.37366133 0.45632517 0.42700613 0.42700613 0.28415899 0.31076536
SK_Al_15_Jul_C_r5 WB_Al_1_Aug_B_r5 WB_Al_1_Aug_C_r5 WB_Al_1_Jul_A_r5 WB_Al_1_Jul_B_r5 WB_Al_1_Jul_C_r5 WB_Al_1_May_A_r5 WB_Al_1_May_B_r5 WB_Al_1_May_C_r5 WB_Al_3_Aug_A_r5
0.27045781 0.44991276 0.44991276 0.33009754 0.38438516 0.41764292 0.21472000 0.28153885 0.22984300 0.41662468
WB_Al_3_Aug_B_r5 WB_Al_3_Aug_C_r5 WB_Al_3_Jul_A_r5 WB_Al_3_Jul_B_r5 WB_Al_3_Jul_C_r5 WB_Al_3_May_A_r5 WB_Al_3_May_B_r5 WB_Al_3_May_C_r5 WB_Al_6_Aug_A_r5 WB_Al_6_Aug_B_r5
0.49120032 0.42663556 0.35145371 0.36377511 0.30185479 0.20106604 0.28675680 0.22790393 0.21670393 0.49749284
WB_Al_6_Aug_C_r5 WB_Al_6_Jul_A_r5 WB_Al_6_Jul_B_r5 WB_Al_6_Jul_C_r5 WB_Al_6_May_A_r5 WB_Al_6_May_B_r5 WB_Al_6_May_C_r5 WB_Al_10_Aug_A_r5 WB_Al_10_Aug_B_r5 WB_Al_10_Aug_C_r5
0.23327256 0.28130088 0.29041386 0.21110947 0.27810656 0.30600134 0.28343466 0.39905808 0.43383459 0.43522620
WB_Al_10_Jul_A_r5 WB_Al_10_Jul_B_r5 WB_Al_10_Jul_C_r5 WB_Al_10_May_A_r5 WB_Al_10_May_B_r5 WB_Al_10_May_C_r5 WB_Al_15_Aug_A_r5 WB_Al_15_Aug_B_r5 WB_Al_15_Aug_C_r5 WB_Al_15_Jul_A_r5
0.32232232 0.32670417 0.35978460 0.39271670 0.24168786 0.25301545 0.44984319 0.49520462 0.47971103 0.26600345
WB_Al_15_Jul_B_r5 WB_Al_15_Jul_C_r5 WB_Al_15_May_A_r5 WB_Al_15_May_B_r5 WB_Al_15_May_C_r5 WB_Ba_50_Aug_A_r5 WB_Ba_50_Aug_B_r5 WB_Ba_50_Aug_C_r5 WB_Ba_50_Jul_A_r5 WB_Ba_50_Jul_B_r5
0.29920657 0.18485765 0.20855562 0.31827490 0.22251747 0.24309575 0.28055177 0.25394471 0.19455368 0.28138514
WB_Ba_50_Jul_C_r5 WB_Ba_50_May_A_r5 WB_Ba_50_May_B_r5 WB_Ba_50_May_C_r5 WB_Eg_0_Jul_A_r5 WB_Eg_0_Jul_B_r5 WB_Eg_0_Jul_C_r5 WB_Eg_0_May_A_r5 WB_Eg_0_May_B_r5 WB_Eg_0_May_C_r5
0.14919037 0.20415548 0.28372189 0.23176714 0.22378662 0.34875438 0.23739855 0.23747241 0.19179601 0.21029102
to_write_discarded <- tibble(sample = discard, value = all_distances[outliers])
to_write_discarded <- to_write_discarded %>% bind_rows(tibble(sample = discard.1,
distance_to_centroid = NA))
Finally, let’s remove these samples from the dataset
ASV.nested %>%
mutate(Step4.tibble = map (Step3.tibble, ~ filter(.,! sample %in% to_write_discarded$sample))) -> ASV.nested
ASV.nested %>%
transmute(Miseq_run, Summary.1 = map(Step4.tibble, ~ how.many(ASVtable = .,round = "5.PCR.dissimilarity"))) %>%
left_join(ASV.summary) %>%
mutate(Summary = map2(Summary, Summary.1, bind_rows)) %>%
dplyr::select(-Summary.1) -> ASV.summary
Joining, by = "Miseq_run"
We will export the final cleaned table with four columns (Miseq_run, sample, Hash, nReads)
rr
ASV.nested %>% unnest(Step4.tibble) %>% mutate(nReads = as.integer(nReads)) %>% write_csv(_data/ASV_table_all_together.csv)
ASV.nested %>% unnest(Step4.tibble) %>% distinct(Hash) %>% left_join(Hash.key) %>% write_csv(_data/Hash_Key_all_together.csv)
library(tidyverse) library(seqinr)
input <- read_csv(_data/Hash_Key_all_together.csv) output <- _data/Hash_Key_all_together.fasta
write.fasta (sequences = as.list(input\(Sequence), names = as.list(input\)Hash), file.out = output)
saveRDS(ASV.nested, file = .ASVs.nested)
Let’s check out the success of our approach - use the taxonomy annotation to look for vertebrate sequences
23.558632/28.289321
[1] 0.8327747
ASV.summary %>%
unnest() %>%
filter (Miseq_run!= "4") %>%
group_by(Stage, Stat) %>%
summarise(number = sum(number)) %>%
spread(key = Stat , value = number)
nest ()
#where are the salmons
ASV.nested %>%
unnest(Step4.tibble) %>% #
left_join(Taxonomy) %>%
separate(sample, into = "biol", sep = "\\.", remove = F) %>%
group_by(Family, biol) %>%
summarise (tot = sum(nReads)) %>%
filter (Family == "Salmonidae") %>%
arrange(desc(tot))
# Prevalence plot as in decontam package
ASV.table %>%
group_by(source, Hash) %>%
summarise(ocurrences =n()) %>%
spread(key = source, value = ocurrences, fill = 0) %>%
left_join(Hashes.to.remove.step2) %>%
mutate(origin = case_when(is.na(origin) ~ "Kept",
TRUE ~ "Discarded")) %>%
ggplot(aes(x = Positives, y = Samples, color = origin))+
geom_point()
ASV.table %>%
group_by(source, Hash) %>%
summarise(ocurrences =n()) %>%
spread(key = source, value = ocurrences, fill = 0) %>%
#left_join(Hashes.to.remove.step2) %>%
#mutate(origin = case_when(is.na(origin) ~ "Kept",
# TRUE ~ "Discarded")) %>%
mutate(second.origin = case_when(Positives >= Samples ~ "Discarded",
TRUE ~ "Kept")) %>%
filter(second.origin == "Discarded") %>%
full_join(Hashes.to.remove.step2) -> Hashes.to.remove.step2
Hashes.to.remove.step2 %>%
filter (Hash == "acebcd5c491bb273f3e4d615cafad649")